Unsupervised Identification of Translationese

نویسندگان

  • Ella Rabinovich
  • Shuly Wintner
چکیده

Translated texts are distinctively different from original ones, to the extent that supervised text classification methods can distinguish between them with high accuracy. These differences were proven useful for statistical machine translation. However, it has been suggested that the accuracy of translation detection deteriorates when the classifier is evaluated outside the domain it was trained on. We show that this is indeed the case, in a variety of evaluation scenarios. We then show that unsupervised classification is highly accurate on this task. We suggest a method for determining the correct labels of the clustering outcomes, and then use the labels for voting, improving the accuracy even further. Moreover, we suggest a simple method for clustering in the challenging case of mixed-domain datasets, in spite of the dominance of domainrelated features over translation-related ones. The result is an effective, fully-unsupervised method for distinguishing between original and translated texts that can be applied to new domains with reasonable accuracy.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Parallel Corpus of Translationese

We describe a set of bilingual English–French and English–German parallel corpora in which the direction of translation is accurately and reliably annotated. The corpora are diverse, consisting of parliamentary proceedings, literary works, transcriptions of TED talks and political commentary. They will be instrumental for research of translationese and its applications to (human and machine) tr...

متن کامل

Translationese: Between Human and Machine Translation

Translated texts, in any language, have unique characteristics that set them apart from texts originally written in the same language. Translation Studies is a research field that focuses on investigating these characteristics. Until recently, research in machine translation (MT) has been entirely divorced from translation studies. The main goal of this tutorial is to introduce some of the find...

متن کامل

A clustering approach for translationese identification

Our paper is concerned with investigating the impact of translationese on the novels of a bilingual writer and asking whether one could determine the authorship of a translated document. The main part of our paper will be centered on selecting a good set of lexical features that can be considered characteristic for an author. We used in our research the novels of Vladimir Nabokov, a bilingual a...

متن کامل

Identification of Translationese: A Machine Learning Approach

This paper presents a machine learning approach to the study of translationese. The goal is to train a computer system to distinguish between translated and non-translated text, in order to determine the characteristic features that influence the classifiers. Several algorithms reach up to 97.62% success rate on a technical dataset. Moreover, the SVM classifier consistently reports a statistica...

متن کامل

A New Approach to the Study of Translationese: Machine-learning the Difference between Original and Translated Text

In this paper we describe an approach to the identification of “translationese” based on monolingual comparable corpora and machine learning techniques for text categorization. The paper reports on experiments in which support vector machines (SVMs) are employed to recognize translated text in a corpus of Italian articles from the geopolitical domain. An ensemble of SVMs reaches 86.7% accuracy ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • TACL

دوره 3  شماره 

صفحات  -

تاریخ انتشار 2015